31 research outputs found
Threshold Choice Methods: the Missing Link
Many performance metrics have been introduced for the evaluation of
classification performance, with different origins and niches of application:
accuracy, macro-accuracy, area under the ROC curve, the ROC convex hull, the
absolute error, and the Brier score (with its decomposition into refinement and
calibration). One way of understanding the relation among some of these metrics
is the use of variable operating conditions (either in the form of
misclassification costs or class proportions). Thus, a metric may correspond to
some expected loss over a range of operating conditions. One dimension for the
analysis has been precisely the distribution we take for this range of
operating conditions, leading to some important connections in the area of
proper scoring rules. However, we show that there is another dimension which
has not received attention in the analysis of performance metrics. This new
dimension is given by the decision rule, which is typically implemented as a
threshold choice method when using scoring models. In this paper, we explore
many old and new threshold choice methods: fixed, score-uniform, score-driven,
rate-driven and optimal, among others. By calculating the loss of these methods
for a uniform range of operating conditions we get the 0-1 loss, the absolute
error, the Brier score (mean squared error), the AUC and the refinement loss
respectively. This provides a comprehensive view of performance metrics as well
as a systematic approach to loss minimisation, namely: take a model, apply
several threshold choice methods consistent with the information which is (and
will be) available about the operating condition, and compare their expected
losses. In order to assist in this procedure we also derive several connections
between the aforementioned performance metrics, and we highlight the role of
calibration in choosing the threshold choice method
Wind-sensitive Interpolation of Urban Air Pollution Forecasts
AbstractPeople living in urban areas are exposed to outdoor air pollution. Air contamination is linked to numerous premature and pre-native deaths each year. Urban air pollution is estimated to cost approximately 2% of GDP in developed countries and 5% in developing countries. Some works reckon that vehicle emissions produce over 90% of air pollution in cities in these countries. This paper presents some results in predicting and interpolating real-time urban air pollution forecasts for the city of Valencia in Spain. Although many cities provide air quality data, in many cases, this information is presented with significant delays (three hours for the city of Valencia) and it is limited to the area where the measurement stations are located. We compare several regression models able to predict the levels of four different pollutants (NO, NO2, SO2, O3) in six different locations of the city. Wind strength and direction is a key feature in the propagation of pollutants around the city, in this sense we study different techniques to incorporate this factor in the regression models. Finally, we also analyse how to interpolate forecasts all around the city. Here, we propose an interpolation method that takes wind direction into account. We compare this proposal with respect to well-known interpolation methods. By using these contamination estimates, we are able to generate a real-time pollution map of the city of Valencia
Technical Note: Towards ROC Curves in Cost Space
ROC curves and cost curves are two popular ways of visualising classifier
performance, finding appropriate thresholds according to the operating
condition, and deriving useful aggregated measures such as the area under the
ROC curve (AUC) or the area under the optimal cost curve. In this note we
present some new findings and connections between ROC space and cost space, by
using the expected loss over a range of operating conditions. In particular, we
show that ROC curves can be transferred to cost space by means of a very
natural way of understanding how thresholds should be chosen, by selecting the
threshold such that the proportion of positive predictions equals the operating
condition (either in the form of cost proportion or skew). We call these new
curves {ROC Cost Curves}, and we demonstrate that the expected loss as measured
by the area under these curves is linearly related to AUC. This opens up a
series of new possibilities and clarifies the notion of cost curve and its
relation to ROC analysis. In addition, we show that for a classifier that
assigns the scores in an evenly-spaced way, these curves are equal to the Brier
Curves. As a result, this establishes the first clear connection between AUC
and the Brier score
CASP-DM: Context Aware Standard Process for Data Mining
We propose an extension of the Cross Industry Standard Process for Data
Mining (CRISPDM) which addresses specific challenges of machine learning and
data mining for context and model reuse handling. This new general
context-aware process model is mapped with CRISP-DM reference model proposing
some new or enhanced outputs
Recommended from our members
Missing the missing values: The ugly duckling of fairness in machine learning
Abstract: Nowadays, there is an increasing concern in machine learning about the causes underlying unfair decision making, that is, algorithmic decisions discriminating some groups over others, especially with groups that are defined over protected attributes, such as gender, race and nationality. Missing values are one frequent manifestation of all these latent causes: protected groups are more reluctant to give information that could be used against them, sensitive information for some groups can be erased by human operators, or data acquisition may simply be less complete and systematic for minority groups. However, most recent techniques, libraries and experimental results dealing with fairness in machine learning have simply ignored missing data. In this paper, we present the first comprehensive analysis of the relation between missing values and algorithmic fairness for machine learning: (1) we analyse the sources of missing data and bias, mapping the common causes, (2) we find that rows containing missing values are usually fairer than the rest, which should discourage the consideration of missing values as the uncomfortable ugly data that different techniques and libraries for handling algorithmic bias get rid of at the first occasion, (3) we study the trade‐off between performance and fairness when the rows with missing values are used (either because the technique deals with them directly or by imputation methods), and (4) we show that the sensitivity of six different machine‐learning techniques to missing values is usually low, which reinforces the view that the rows with missing data contribute more to fairness through the other, nonmissing, attributes. We end the paper with a series of recommended procedures about what to do with missing data when aiming for fair decision making
An instantiation for sequences of hierarchical distance-based conceptual clustering
In this work, we present an instantiation of our framework for Hierarchical Distance-based Conceptual Clustering (HDCC) using sequences, a particular kind of structured data. We analyze the relationship between distances and generalization operators for sequences in the context of HDCC. HDCC is a general approach to conceptual clustering that extends the traditional algorithm for hierarchical clustering by producing conceptual generalizations of the discovered clusters. Since the approach is general, it allows combining the flexibility of changing distances for different data types at the same time that we take advantage of the interpretability offered by the obtained concepts, which is central for descriptive data mining tasks. We propose here different generalization operators for sequences and analyze how they work together with the edit and linkage distances in HDCC. This analysis is carried out based on three different properties for generalization operators and three different levels of agreement between the clustering hierarchy obtained from the linkage distance and the hierarchy obtained by using generalization operators.Sociedad Argentina de Informática e Investigación Operativ
Predictable Artificial Intelligence
We introduce the fundamental ideas and challenges of Predictable AI, a
nascent research area that explores the ways in which we can anticipate key
indicators of present and future AI ecosystems. We argue that achieving
predictability is crucial for fostering trust, liability, control, alignment
and safety of AI ecosystems, and thus should be prioritised over performance.
While distinctive from other areas of technical and non-technical AI research,
the questions, hypotheses and challenges relevant to Predictable AI were yet to
be clearly described. This paper aims to elucidate them, calls for identifying
paths towards AI predictability and outlines the potential impact of this
emergent field.Comment: 11 pages excluding references, 4 figures, and 2 tables. Paper Under
Revie